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Abstract 

The birth and decline of disciplines are critical to science and society. 
However, no quantitative model to date allows us to validate competing 
theories of whether the emergence of scientific disciplines drives or follows 
the formation of social communities of scholars. Here we propose an 
agent-based model based on a social dynamics of science, in which the 
evolution of disciplines is guided mainly by the social interactions among 
scientists. We find that such a social theory can account for a number 
of stylized facts about the relationships between disciplines, authors, and 
publications. These results provide strong quantitative support for the key 
role of social interactions in shaping the dynamics of science. A "science 
of science" must gauge the role of exogenous events, such as scientific 
discoveries and technological advances, against this purely social baseline. 

1 Introduction 

Understanding the dynamics of science as a human endeavor — the birth, evolu- 
tion, and decline of disciplines — is of critical importance for allocating resources 
and planning toward positive societal impact. For example, the emergence of 
new fields such as bioinformatics, nanophysics, information technology, quantum 
computing, and data science promises "converging technologies" with unparal- 
leled potential to influence our lives. This paper is about modeling the dynamic 
evolution of scientific disciplines. 

Efforts to describe, explain and predict different aspects of science have inten- 
sified in recent years [5j |3j [24] spanning a wide range of theoretical, mathemat- 
ical, statistical and computational approaches. A number of models of the dy- 
namics of science have explored the continually changing disciplinary relations. 
Whether the new disciplines or fields are the results of branching of the old ones 



due to growth and new discoveries [23] [15], or go through a "specialization- 
fragmentation- hybridization" cycle [7], or grow through the synthesis of ele- 
ments of pre-existing ones [8 , these models point to the self-organizing devel- 
opment of science exhibiting growth and emergent behavior [20j [25] [26] . 

Kuhn's cognitive theory emphasizes the role of observations not explained 
by previous paradigms [12 . Other scholars [6] [27] emphasize the formation of 
social groups of scientists as the driving force behind the evolution of disciplines. 
These models, however, are difficult to validate empirically or lack explanations 
of the processes leading to the empirical patterns they describe. 

Agent-based models [22] allow one to generate macroscopic predictions from 
micro- level mechanisms guiding the behavior of individuals, thus providing a 
powerful approach to model the emergence of disciplines. While this approach 
has been used to create models of science dynamics [9] [4] [28] , the focus was 
primarily on coauthorship, publication, and citation behavior rather than the 
emergence of disciplines. 

Quantitative work on modeling the emergence of disciplines is lacking to 
date, owing in part to the difficulty of formally defining the notion of scientific 
field, and the consequent sparsity of data to inform and validate models. Here 
we offer a first quantitative baseline model to explore the consequences of as- 
suming a purely social basis of science dynamics, without explicit references to 
exogenous events such as scientific discoveries. In our model agents represent 
scholars who choose their collaborators, while groups of collaborating scholars 
represent scientific disciplines [21]. The key idea behind our model is that new 
scientific fields emerge from splitting and merging of these social communities. 
Our model thus defines a social dynamics of science, in which the birth and 
evolution of disciplines is guided mainly by the social interactions among sci- 
entists. We find that such a social theory can account for several stylized facts 
about the relationships between disciplines, authors, and publications. 

2 Model Description 

The critical assumption of our model is the correspondence between the social 
dynamics of scholar communities and the evolution of scientific disciplines. To 
illustrate this intuition, let us look at the coauthorship network for papers pub- 
lished by the American Physical Society (APS). Using journals as proxies for 
scholarly communities, we can track the changes in community structure over 
time. Fig. [I] plots the modularity [19] of the partition induced by the journals; 
higher values indicate a more clustered structure (see Methods). 

We observe noticeable changes in modularity around the introduction of new 
journals. Some of these changes suggest a scenario in which a new field emerges 
(e.g., quantum mechanics in the late 1920's), and a new journal captures the 
corresponding scholar community, leading to an increase in modularity. Interdis- 
ciplinary interactions across established areas lead to a decrease in modularity 
(e.g., prior to the introduction of Physical Review E in the 1990's). These ob- 
servations motivate a community detection approach to model the evolution of 
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disciplines. 

In the proposed model, which we call SDS (for Social Dynamics of Science), 
we build a social network of collaborations whose nodes are authors, linked by 
coauthored papers as illustrated in Fig. |2|a). Each author is represented by a 
list of disciplines indicating the scientific fields they have been working on, and 
every discipline has a list of papers. Similarly, each link is represented by a list 
of disciplines with associated papers describing the collaborations between two 
authors. 

There are three elements in the SDS model: papers, authors, and disciplines. 
The social network starts with one author writing one paper in one discipline. 
The network then evolves as new authors join, new papers are written, and new 
disciplines emerge over time. 

At every time step, a new paper is added to the network. Its first author is 
chosen uniformly at random, so every author has a chance to publish a paper. 
In modeling the choice of collaborators, we aim to capture a few basic intuitions: 
(i) authors who have collaborated before are likely to do so again; (ii) authors 
with common collaborators are likely to collaborate with each other; (iii) it is 
easier to choose collaborators with similar than dissimilar background; and (iv) 
authors with many collaborations have higher probability to gain additional 
ones p~4j [1]. We model these behaviors through a biased random walk [13], 
illustrated in Fig. |2|b) . The length of the random walk determines the number 
of coauthors. At each step in the walk, the author visits a node i (starting 
with itself) and decides to stop with probability p wi or to search for additional 
authors with probability 1 — p w . In the latter case a neighbor j is selected as 
coauthor according to the transition probability: 

E k w ik 

where Wij is the weight of the edge connecting authors i and j, that is, the 
number of papers that i and j have written together. Note that the walk may 
result in a single author. 

We propose a simple mechanism to model knowledge diffusion through col- 
laboration in SDS: when authors write a paper together, they all contribute 
their knowledge. Therefore, a paper inherits the union of the author disciplines 
as topics. However, the discipline that is shared by the majority of authors is 
selected as the main topic of the paper (say, the publication venue) and diffuses 
across all the authors. Through the collaboration, authors acquire knowledge 
of and membership in this area. 

At every time step, with probability p n: we add a new author to the network 
with the new paper. The parameter p n regulates the ratio of papers to authors. 
The new author is the first author of the new paper. To generate other col- 
laborators, an existing author is first selected uniformly at random as the first 
coauthor. Then the random walk procedure is followed to pick additional col- 
laborators. The new author acquires the main topic of the paper. 

We introduce a novel mechanism to model the evolution of disciplines by 
splitting and merging communities in the social collaboration network. The 
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idea, motivated by the earlier observations from the APS data, is that the birth 
or decline of a discipline should correspond to an increase in the modularity of 
the network. Two such events may occur at each time step with probability pd- 
The process is illustrated in Fig. [3j 

For a split event we select a random discipline with its coauthor network and 
decide whether a new discipline should emerge from a subset of this community. 
We partition the coauthor network into two clusters (see Methods). If the 
modularity of the partition is higher than that of the single discipline, there are 
more collaborations within each cluster than across the two. We then split the 
smaller community as a new discipline. In this case the papers whose authors 
are all in the new community are relabeled to reflect the emergent discipline. 
Borderline papers with authors in both old and new disciplines are labeled 
according to the discipline of the majority of authors. Some authors may as a 
result belong to both old and new discipline. 

For a merge event we randomly select two disciplines with at least one com- 
mon author. If the modularity obtained by merging the two groups is higher 
than that of the partitioned groups, the collaborations across the two commu- 
nities are stronger than those within each one. The two are then merged into a 
single new discipline. In this case all the papers in the two old disciplines are 
relabeled to reflect the new one. 

3 Results 

To evaluate the predictive power of the SDS model we consider a number of 
stylized facts, i.e., broad empirical observations that describe essential charac- 
teristics of the dynamic relationships between disciplines, scholars, and publi- 
cations. Our model provides an explanation for the evolution of scientific fields 
if it can reproduce these empirical observations. The complex interactions of 
a changing group of scientists, their artifacts, and their disciplinary aggrega- 
tions can be captured by the broad empirical distributions of six quantitative 
descriptors: the number of authors per paper Ap (collaboration size); the num- 
ber of papers per author Pa (scholar productivity); the number of authors per 
discipline Ad (discipline popularity); the number of disciplines per author Da 
(scholar interdisciplinary effort); the number of papers per discipline Ppj (dis- 
cipline productivity); and the number of disciplines per paper Dp (publication 
breadth). 

To validate the SDS model, one would ideally require a real-world dataset 
mapping the three-way relationships between scholars, publications, and disci- 
plines. Unfortunately, to the best of our knowledge, no such dataset is publicly 
available. As an alternative, we adopt three large datasets that each map a 
binary projection of these relationships: NanoBank [30] to validate the relation- 
ship between authors and papers, Scholarometer [11] to study the relationship 
between authors and disciplines, and Bibsonomy [2] to analyze the relationship 
between papers and disciplinary topics. The datasets are described in the Meth- 
ods section. The parameters p n: p w , and pd of our model are tuned to fit the 
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quantitative descriptors of each dataset (see Methods). 

Fig. [4] presents a close match between the real data and the predictions of 
our model. SDS reproduces the stylized facts about the relationships between 
scholars, publications, and disciplines, characterized by these six distributions. 
The exponential distribution of Ap is captured by the random walk process. 
The broad distribution of scholar productivity Pa is well accounted for by the 
bias in the random walk, which incorporates a kind of preferential attachment 
mechanism regulated by prior collaborations. The distributions of discipline 
popularity Ap, and productivity Pp, also display heavy tails, which cannot be 
attributed to a specific mechanism in the model; they emerge from the non- 
trivial interactions between (i) merging and splitting of the discipline commu- 
nities and (ii) knowledge diffusion from the collaborations. The prediction is 
not as good for Da- our model produces a relatively large number of highly in- 
terdisciplinary authors. One could correct this effect, for example, by requiring 
more than one paper in a discipline as a condition for membership. However, 
this would require an additional parameter and thus a more complicated model. 
Finally, The distribution of publication Dp shows that there is a continuum in 
the breadth of papers, rather than a sharp separation between disciplinary and 
interdisciplinary work. 

These results focus on the relationships between disciplines, authors, and 
papers, for which there is little prior quantitative analysis. The coauthor net- 
work, on the other hand, has been studied extensively in the past [l6j[T7]. As 
shown in Fig. |5j the SDS model generates coauthor networks whose long-tailed 
degree distributions are consistent with the empirical data, as well as with those 
in the literature. 

4 Conclusions 

We introduced an agent-based model to simulate the evolution of science as a 
process driven only by social dynamics. The model captures for the first time 
major stylized facts about the complex socio-cognitive interactions of a chang- 
ing group of scholars, publications, and scientific communities. The SDS model 
is relatively simple when one considers the complexity of the science dynamics 
process being studied, yet powerful in its capability to reproduce the emergence 
of patterns similar to those observed in three real datasets about scientific pro- 
duction and fields. This provides us with strong quantitative support for the key 
role of social dynamics in shaping the birth, evolution, and decline of scientific 
disciplines. Future "science of science" studies will have to gauge the role of 
scientific discoveries, technological advances, and other exogenous events in the 
emergence of new disciplines against this purely social baseline. 
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5 Methods 



Modularity [19] measures the strength of a network partition into clusters of 
nodes. It compares the number of edges falling within groups with the expected 
number in an equivalent network from a null model with the same degree se- 
quence but shuffled edges. Larger values indicate stronger community structure. 
Here we consider the weighted extension of modularity. Let be the weight 
of an edge (number of coauthored papers) between nodes i and j, and Wij its 
expected value. The weighted modularity is defined as: 

« = iEK-^]%.fe) (2) 

ij 

where S(gi,gj) = 1 if gi = gj (i and j are in the same group) and otherwise; 
m is the sum of all edge weights in the network. Wij is computed as: 

where is the strength or weighted degree of node z, Si = . Wij. 

When splitting disciplines, in practice, we use the leading eigenvector 
method [18] based on the (non-weighted) modularity matrix, as an efficient 
and effective algorithm to cluster a coauthor network into two groups. 

The APS dat aset (F ig. \\\ was made available by the American Physical 
Society (publish.aps.org/datasets/). We consider the papers appearing in 
eight journals during the period of 1913-2000: Physical Review (PR) 1913- 
1955, Review of Modern Physics (RMP) 1929-2000, Physical Review Letters 
(PRL) 1958-2000, Physical Review A, B, C, D (PRA-D) 1970-2000, and Physical 
Review E (PRE) 1993-2000. 

The SDS model is validated against three datasets: 

NanoBank (version Beta 1, released on May 2007) [30] is a digital library of 
bibliographic data on articles, patents and grants in the field of nanotech- 
nology. A set of nanotechnology-related articles in NanoBank has been 
selected from the Science Citation Index Expanded, Social Sciences Ci- 
tation Index, and Arts and Humanities Citation Index produced by the 
Institute for Scientific Information (now Thomson Reuters). Two docu- 
ment selection criteria have been used in the creation of NanoBank [29] : 
(i) articles that contain some of the 379 terms identified by subject spe- 
cialists as being "nano-specific," and (ii) articles based on a probabilistic 
procedure for the automatic identification of terms. The database cov- 
ers a 35-year period (1970-2004). This dataset is used to validate the 
relationship between authors and papers. 

Scholarometer (scholarometer.indiana.edu) is a social tool for scholarly 
services developed at Indiana University, with the goal of exploring the 
crowdsourcing approach for disciplinary annotations and cross-disciplinary 
impact metrics [lOj [11] . Users provide discipline annotations (tags) for 
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queried authors, which in turn are used to compare author impact across 
disciplinary boundaries. The data collected by Scholarometer is available 
via an open API. We use this data to study the relationship between 
authors and disciplines. 



Bibsonomy (www .bibs onomy . org) is a system for sharing bookmarks and lists 



of literature [2]. Users annotate papers with tags. The dataset is free for 
research purposes. We downloaded a dump as of 2012-01-01 to analyze 
the relationship between papers and disciplines. To filter noise from many 
junk annotations, we removed the tags associated with fewer than 3 papers 
or more than 6,000 papers. 

The SDS model has three parameters: p n controls the number of papers per 
author; p w controls the number of authors per paper; finally, pd is the frequency 
of network split and merge events, and controls the number of disciplines. To 
generate predictions from numerical simulations of the model, we first tune these 
parameters to fit the properties of the three empirical datasets individually. 
Table [I] reports the main properties of the datasets and the matching model 
parameters. As shown in Table |2j the SDS model is capable of approximating 
the basic statistics of the empirical data. 
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Figures and Tables 
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Figure 1: Modularity Q of APS journal-induced scholar communities. For each 
year t, we build a coauthor network based on the papers published in the 5-year 
time interval between t — 2 and t + 2. Such a network snapshot consists only 
of active authors, who published at least one paper in that time window. If an 
author published papers in more than one journal, we select the first journal 
in that period. The grey areas correspond to the introduction of major new 
journals. 
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Figure 2: (a) Illustration of the social network structure. Nodes and edges 
represent authors and their collaborations. They are annotated with lists of 
(co) authored papers grouped by scientific fields. For example, author b has five 
papers including four in computer science (CS) and one in Math. Papers 1 and 
2 are coauthored with a, papers 5 and 6 with c, and paper 5 with d. Paper 4 is 
authored by b alone, (b) Illustration of the random walk mechanism to select 
authors. For the new paper 7, the first author a is chosen randomly and then 
walks to b and c, stopping at d. These four authors become connected to each 
other if they have not collaborated before; for example, new edges connect a 
to c and d. Paper 7 acquires topics CS, Math and Physics (Phy). The main 
(majority) field of the paper, CS, diffuses across the coauthors, including d who 
joins this discipline as a result. 
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Figure 3: Discipline evolution, (a) The coauthor network of discipline D\ is 
split into two disciplines D2 and D%. The modularity increases from Q = 
to Q = 0.4. The dashed line indicates the partition of the network suggested 
by the community detection algorithm. Some nodes in the new discipline D3 
have also published papers with authors in D2, and therefore belong to both 
disciplines, (b) Two coauthor networks of disciplines D4 and D$ are merged 
into new discipline Dq. For authors in both original disciplines, we pick one 
based on the number of papers published in each discipline. The dashed line 
shows the resulting partition, with very low modularity Q = —0.1. The merged 
community Dq has still low, but higher mudularity Q = 0. 
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Figure 4: Stylized facts characterizing relationships between authors, papers, 
and disciplines. We plot the distributions of (a) authors per paper, (b) papers 
per author, (c) authors per discipline, (d) disciplines per author, (e) papers per 
discipline, and (f) disciplines per paper. Blue circles represent the SDS predic- 
tions, while red symbols represent the empirical data from the three datasets. 
The results of the model are averaged over 10 runs. 
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Figure 5: Degree distribution of the coauthor network generated by the SDS 
model, compared to the empirical distribution from the Bibsonomy dataset. A 
similar match is also observed for other datasets (not shown). A few papers 
with more than 100 authors were excluded as they generate an anomaly in the 
tail; each such paper generates at least 100 nodes with degree at least 100. 



Table 1: Dataset properties and tuning of SDS model parameters. For each 
dataset we run the simulations until the empirical number of papers or authors 
is reached (shown in bold). 





NanoBank 


Scholarometer 


Bibsonomy 


Number of papers 


2.7 x 10 5 


1.4 x 10 6 


2.9 x 10 5 


Number of authors 


2.9 x 10 5 


2.2 x 10 4 


3.2 x 10 5 


Number of disciplines 


n/a 


1.1 x 10 3 


4.4 x 10 4 


Pn 


0.90 


0.04 


0.80 


Pw 


0.28 


0.35 


0.71 


Pd 


n/a 


0.01 


0.50 



Table 2: Basic statistics of empirical datasets compared with SDS model predic- 
tions. The reported values are averages, and standard deviations are obtained 
by 10 realizations of the model. 



Quantity 


Dataset 


Empirical 


SDS 


A P 


NanoBank 


4.006 


4.011 ±0.001 


Pa 


NanoBank 


3.666 


4.456 ± 0.005 


A D 


Scholarometer 


45 


60 ±10 


D A 


Scholarometer 


2.2 


3.5 ±0.4 


Pd 


Bibsonomy 


24 


22 ±1 


Dp 


Bibsonomy 


3.6 


3.3 ±0.2 
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